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A spectral-element framework is presented for the simulation of subsonic compressible 
high-Reynolds-number flows. The focus of the work is maximizing the efficiency of the 
computational schemes to enable unsteady simulations with a large number of spatial and 
temporal degrees of freedom. A collocation scheme is combined with optimized compu- 
tational kernels to provide a residual evaluation with computational cost independent of 
order of accuracy up to 16th order. The optimized residual routines are used to develop a 
low-memory implicit scheme based on a matrix-free Newton-Krylov method. A precondi- 
tioner based on the finite-difference diagonalized ADI scheme is developed which maintains 
the low memory of the matrix-free implicit solver, while providing improved convergence 
properties. Emphasis on low memory usage throughout the solver development is lever- 
aged to implement a coupled space-time DG solver which may offer further efficiency gains 
through adaptivity in both space and time. 

I. Introduction 

The Variational Multiscale Method (VMM) is a reformulation of the classical Large-eddy Simulation 
(LES), where the filtering operation, used to explicitly separate resolved and unresolved scales, is replaced 
by a finite-element projection operator. * 1 As projection and differentiation commute, VMM does not suffer 
from commutation errors present in classical LES. VMM subgrid-scale models can be limited to the finest 
computed modes, so that no modeling error is introduced into the resolved coarser modes. 2 Combined with 
adaptivity, this provides a robust h-p spectral-element framework for viscous simulations which has shown 
promise for predicting attached wall-bounded flows. 1 ’ 3-8 

The ultimate objective of this work is to apply the VMM to compressible, high-Reynolds number sep- 
arated flows where Direct Numerical Simulation (DNS) is impractical, and engineering Reynolds-Averaged 
Navier-Stokes (RANS) turbulence models struggle to accurately capture the important flow features. Specif- 
ically, our initial target is unsteady simulation of the Fundamental Aeronautics Investigation of The Hill 
(FAITH) experiment, shown in Fig. 1. The FAITH experiment includes a number of quantitative and qual- 
itative measurement techniques to characterize the separated flow over an axisymmetric hill at Reh = 500k, 
where h is the height of the hill. VMM simulations will complement this experimental dataset with detailed 
statistical measurements of the turbulence correlations and budget terms necessary to develop and validate 
novel RANS models capable of predicting separated flows. 

Simulation of the FAITH experiment with a wall-resolved VMM poses a significant computational chal- 
lenge due to the high Reynolds number. To estimate the total cost, we consider a computational domain 
including the majority of the wind-tunnel test section, with the wall-normal direction resolved to within 
one friction length scale at the wall, i.e. A y + \ w = 1. Using the VMM, we estimate a requirement for the 
average streamwise and spanwise resolution of Aa; + = A z + ~ 20 respectively. To accurately propagate the 
resolved turbulent fluctuations an appropriate unsteady time scale is required, which corresponds roughly to 
a CFL=1 based on the r.m.s. mean velocity. With this temporal resolution, we bookkeep a simulation time 
of 10 flow-through periods, which is a minimum to capture statistical measurements. From this, we estimate 
a spatial computational mesh requiring on the order of 10 9 degrees of freedom (DOF), and a simulation time 
requiring 10 9 timesteps, for a total of 10 18 space-time DOF. 
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Given these computational cost estimates, the efficiency of a VMM solver is critical to enabling FAITH 
simulations. After surveying the open-source community, NASA, and NASA’s collaborative partnerships, 
none of the available options for an existing spectral-element framework were deemed sufficient, or could be 
easily modified, to provide the necessary efficiency and accuracy. From this, a decision was made to develop 
a VMM spectral-element solver specifically for low-speed, high-Reynolds number flows over geometrically 
simple configurations. The goal of this paper is to describe the design of this framework in terms of algorithm 
choice, data layout in memory, and tightly-coupled optimizations for the target computer hardware. Specifi- 
cally, we seek to minimize three weaknesses of high-order finite-element formulations - the high mathematical 
operation count per element, the restrictive stability constraint for explicit methods, and the large memory 
required for efficient implicit methods. The current work focuses solely on these numerical challenges, with 
closure modeling and accurate reproduction of the relevant physics deferred to future discussions. 

The paper begins with an introduction to the nomenclature and basic spectral-element formulation. From 
this, algorithm and computational optimizations are discussed along with representative computational tim- 
ings. A matrix-free implicit method is described, along with a novel method for incorporating a diagonalized 
Alternating-Direction-Implicit (ADI) preconditioner. Lastly, a space-time formulation is presented which 
effectively leverages spectral elements in space and time to reduce the computational cost. Two model 
problems, convection of an isentropic vortex and the Taylor-Green vortex evolution, are used throughout to 
provide computational examples. The paper concludes with a summary of the major points and some topics 
for future work. 


II. Spatial Discretization 

The compressible Navier-Stokes equations are written in conservative form as: 

+ V • (F 1 — F v ) = 0 (1) 

The conservative state vector is u = [p, pv, pE\, where p is the density, v the velocity, and E the total energy. 
The inviscid and viscous fluxes are given, respectively, by: 
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where p is the static pressure, H = E + - is the total enthalpy, t is the shear stress tensor, kt is the thermal 
conductivity, T = p/ pR is the temperature, and R is the gas constant. The pressure is given by: 

P = (7 - !) (pE ~ §P V ■ v ) . (3) 

where 7 is the specific heat ratio. The viscous stress tensor, r, is given by: 

r = p (Vv + Vv T ) - A(V • v)J (4) 
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where /t is the viscosity, A = |/t is the bulk viscosity and I is the identity matrix. 

In this work, we use a Galerkin finite-element method, which extends to arbitrary order of accuracy. 
Higher-order methods show potential for simulations requiring high spatial and temporal resolution, allowing 
for solutions with fewer degrees of freedom and lower computational cost than traditional second order CFD 
methods. 10 As this work is focused on subsonic flows without shocks, the exact solution is in C°° , and thus 
we do not expect the convergence rate of our higher order scheme to be limited by solution irregularity. 
Higher-order finite element methods are particularly attractive due to the possibility of using local h- and p- 
adaptation. Galerkin finite-element methods have an increased number of operations, relative to high-order 
finite-difference methods, due to the increased coupling between degrees of freedom. High-order finite- 
difference schemes couple degrees of freedom along lines in each coordinate direction, so that the cost of a 
residual evaluation scales as N x d for a fixed number of degrees of freedom, where N = p + 1 is the order 
of accuracy and d the spatial dimension. A general Galerkin method couples all degrees of freedom on an 
element, leading to a cost that scales as N d . This scaling suggests that in three dimensions, a general Galerkin 
formulation is penalized in terms of operation count relative to finite-difference methods beyond third order. 
In order to attain similar efficiency within a Galerkin framework, we use tensor product bases and a sum 
factorization approach, such that differentiation and integration reduce to a sequence of one-dimensional 
operations. This results in a cost which also scales as N x d. 

The domain, H, is partitioned into non-overlapping hexahedral elements n. We define a finite-element 
space Vh consisting of piece-wise polynomial functions on each element k: 

V h = {w,w\ K e [l } (5) 

where m = 5 is the number of flow equations. Choosing Vh to be C°-continuous across elements leads to 
a continuous Galerkin (CG) discretization, while allowing Vh to be discontinuous across elements leads to 
a discontinuous Galerkin (DG) discretization. We have developed a generic finite-element framework using 
both CG and DG formulations, however, the results presented in this work focus on DG to clarify the 
discussion. 

We seek a solution u GVh which satisfies the weak form of Eq. (1): 

J2{J k - v™- (F I -F v ) S j +J w{F-F V )-n j=0 WwGV h . (6) 

~ i ~ v 

Here F and F denote numerical fluxes which are functions of the state on both sides of a face shared by 
two elements, or state and boundary data for a face on the domain boundary. The inviscid flux is computed 
using the Roe flux, 11 while the viscous flux is computed using the method of Bassi and Rebay. 12 In a CG 
method, the surface integrals corresponding to faces on the interior of the domain cancel to zero, leaving 
only surface integrals on the boundary of the domain. 

Using a tensor product bases on each element, the solution u is given by a product of Lagrange polyno- 
mials: 

1t(x(£)) = ^ ijk&ijk ^ ijk = (£ 3 ) (7) 

i,j,k 

where x(£) defines a mapping from the reference cube, £ C [—1, l] 3 * , to physical space. 4>i is a one-dimensional 
Lagrange basis defined at Gauss-Legendre (GL) or Gauss-Legendre-Lobatto (GLL) points, while U ijk is the 
corresponding nodal value of the solution. The integrals in Eq. (6) are evaluated using numerical quadrature. 
For volume integrals: 

f ( w ^- -Vw ■ {F 1 - F v )\ ~ {( w ^ ~ • (F 1 - F l )\ |J|j w p w q w r (8) 

d K A / p,q,r '-A J ) £p£q£ r 

where £ P ,£ 9 ,£r are one-dimensional GL or GLL quadrature points, and w p , w q and w r are the associated 
quadrature weights. J denotes the Jacobian of the mapping from element reference space to physical space, 
denotes the gradient with respect to the local coordinate £, while F = J~ 1 F I and F = J -1 ^ 1 are the 
fluxes mapped to the local element coordinate system. Similarly, surface integrals in Eq. (6) are evaluated 
as: 

[ w(F T - F v )-n ~ V \w(F I - F v ) ■ n) w p w q (9) 

J dn p j(J ^ GG 
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Both volume and surface integrals (Eqs. (8) and (9)) are performed as a sequence of three steps described 
below: 

1. Evaluation of the state and gradient at the quadrature points. 

2. Evaluation of the fluxes at the quadrature points. 

3. Multiplication of the fluxes with the gradient of the basis functions. 

The sum-factorization approach 13 is used in steps 1 and 3. The state at a particular quadrature point is 
given by: 


Upqr — u (£,p: Cqo £r) — EEE Uijk ( t>i\£ p ( f > j\£ q (t > k\£ q (10) 

i j k 

The state evaluated at all quadrature points within an element is computed as: 

U = (B 3 ®B 2 ®B 1 )U = (B 3 ®I®I){(I®B 2 ®I){(I®I®B 1 )U}} (11) 

where U is vector of states at all quadrature points in an element, (B 3 ® B 2 ® B\) is the matrix formed as 
the tensor product of N q x N matrices B i, B 2 and B 3 , where N q is the number of 1-dimensional quadrature 
points, while N is the number of 1-dimensional basis functions. The entries of B i, B 2 and B 3 are: 


[B\]pi — [B2\qj — 4>j\^ q [B 3 ]rk — <fik\£ r (12) 

In Eq. (11), we show that the tensor product structure the multiplication of (B 3 ® B 2 ® B i), is performed 
as a sequence of small matrix multiplications corresponding to each coordinate direction. The derivative of 
the state with respect to £1 is computed at the quadrature points as: 

= ( B 3 ®B 2 ®D 1 )U = (B 3 ®I®I){{I®B 2 ®I){(I®I®D 1 )U }} (13) 

where the entries of the matrix D\ are given by: 

[Di] P i = d -^r\z P (14) 

Derivatives in the £ 2 and £ 3 directions are computed in a similar manner. 

The use of the sum factorization approach leads to an operation count which scales as N x d for fixed 
number of degrees of freedom. However, in order to further reduce the cost of our finite-element discretization 
we use a collocation approach to compute the integrals involved in the residual. When using a collocation 
approach, the solution points are used as quadrature points reducing the cost of the state and residual 
evaluations since operations involving the matrices B\, B 2 and B 3 are eliminated (2?i, B 2 and B 3 reduce to 
the identity matrix, and hence multiplication by these terms is not necessary). The state is stored at the 
collocation points and thus is directly available, while the derivatives are computed using terms of the form: 


au 

Wi 


(I ® I ® Di)U 


(15) 


The third step of the residual evaluation, requiring the weighting of the flux with the gradient of the basis 
functions, is computed in a similar manner. Namely, using collocation, we compute volume terms of the 
form: 


R k = ~{I®I®D l ) T F 1 -{I®D 2 ®I) T F 2 -{D 3 ®I®I) T F 3 (16) 


where Fi is the vector of fluxes scaled by the Jacobian determinant and quadrature weights at each point: 


F i 


pqr 


(FiW\) 


tptiqir 


(17) 


while F 2 and F 3 have similar forms. 
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With the use of collocation and tensor products the amount of computational work is reduced to a 
minimum, however we still have a high operation count per degree of freedom. We minimize this cost through 
code optimization. Recently, increases in computational performance have been largely achieved through 
increased parallelization, both through the use of systems with increasing number of processors, as well 
as on-processor parallelization through threading and vectorization. High-order finite-element methods are 
well suited to take advantage of parallel computing. The finite-element method provides a straight-forward 
domain-decomposition with relatively little coupling between elements consisting only of data on element 
interfaces. Thus, we take advantage of multi-processor parallelism with MPI-based message passing, where 
the small amount of communication overhead is masked by overlapping communication and computation. 
Additionally, operations local to an element are designed in such a way as to align data in a format well 
suited to vectorization. 

The finite-element framework developed here is designed for use with the computing resources at NASA 
Ames Research Center. In particular, we look to use the Intel Sandy Bridge and Intel Xeon Phi micro- 
processor architectures, as well as their future descendents, which achieve high-performance through single- 
instruction/multiple-data (SIMD) vector operations. The Sandy Bridge architecture uses the Advanced 
Vector Extensions (AVX) instruction set with a SIMD register of 256 bits (4 doubles), while the Xeon Phi 
has a SIMD register of 512 bits (8 doubles) allowing 4 or 8 double operations to be performed in parallel. 
In order to take full advantage of the available computing resources, it is essential to use the vectorizing 
capabilities of these micro architectures. In particular, efficient vectorization requires alignment of data and 
unit-stride memory operations. 

In order to achieve high computational efficiency, we have optimized the numerical kernels involved in the 
computation of the residual and reuse these kernels throughout our finite-element software. In particular, we 
use a set of optimized matrix-matrix multiplication routines for computing the sum-factorization terms in 
Eqs. (11), (14) and (16), while the evaluation of fluxes at collocation points is performed using data-aligned, 
unit-stride vector functions. 

The use of the sum-factorization approach and optimized routines allow for residual evaluations with a 
computational cost that is nearly independent of N for N = [2, 16]. The CPU time for a residual evaluation, 
normalized by the number of degrees of freedom for our DG discretization is shown in Fig. 2. The CPU time 
is obtained by performing a simulation of the Taylor-Green vortex problem described in Section III.B on a 
Sandy Bridge node of the Pleiades supercomputer at NASA Ames Research Center. Each Sandy Bridge node 
consists of 2 eight-core Intel Xeon E5-2670 processors with a clock speed of 2.6Ghz and 2GB per core memory. 
The simulations were performed with approximately 30000 degrees of freedom per core. Through optimized 
linear algebra kernels, the increase in operation count with N is offset by more effective use of computing 
resources. For N = [4,16], the CPU time for a single residual evaluation is approximately 0.75 [is/ DOF 
which is comparable to the cost for a residual evaluation using the OVERFLOW finite-difference code on 
the same architecture. For reference, the TAU benchmark for this architecture runs in 7.6s. 10 



N 

Figure 2. CPU time for a single residual evaluation per degree of freedom on an Intel Sandy Bridge node 
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III. Model Problems 


Two model problems are used throughout to examine the behavior of the spectral-element framework. 
These are the inviscid convection of a two-dimensional isentropic vortex, and viscous simulation of the 
Taylor-Green vortex evolution. These problems are representative of the physics without containing solid 
boundaries, which would necessarily require sub-grid modeling for any appreciable Reynolds number. 

III. A. Convection of an Isentropic Vortex 

The inviscid isentropic vortex convection problem is initialized with a perturbation about a uniform flow 
given by: 


Su = 


(18) 


7* — nr- r 2 

(19) 

Sv = 

UoJ3 — — 2 

K 

II 

nTT^ 

1 7 u 2 B 2 e~ r2 
2 7 -l 

(20) 


where Uqo is the convecting speed of the vortex, /3 = 1/5 is the vortex strength, R = 0.05 the characteristic 
radius, and ( x c ,y c ) the vortex center. The convection velocity is angled 30° from the x-direction so that 
the flow does not align with the mesh. The exact solution to this flow is a pure convection of the vortex at 
speed Uoo. Figure 3 shows contours of the density after 5 periods for the simulation of the isentropic vortex 
at a Mach number of 0.5 using N = 2 and N = 16 using 64 degrees of freedom in x- and y- directions. 
Figure 3 demonstrates the benefits of higher-order methods as the 16th-order method is able to propagate 
the vortex through 5 periods nearly exactly. The quantitative estimate of formal accuracy will be described 
in Sec. V.A. 



(a) N = 2 (b) N = 16 

Figure 3. Density contours for isentropic vortex convection problem after 5 periods using N = 2 and N = 16 with 64 
DOF in each coordinate direction. 


III.B. Taylor-Green Vortex 

The Taylor-Green vortex flow is simulated using the compressible Navier-Stokes equations at Mq = 0.1. The 
flow is solved on an isotropic domain which spans [0, 2nL] in each coordinate direction. The initial conditions 
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are given by: 


u 

v 

w 

P 


Vq sin(x/L) cos (y/L) cos (z/L) 
—Vo cos (x/L) sin (y/L) cos (z/L) 
0 



1 

16 


(cos(2:r) + sin(2j/)) (cos(2.z) + 2)) 


(21) 

(22) 

(23) 

(24) 


where u,v and w are the components of the velocity in the x, y and 2 -directions, p is the pressure and p is 
the density. The flow is initialized to be isothermal (^ = = RTo). Figure 4 shows the temporal evolution 

of the kinetic energy dissipation rate — C ^ L for the Taylor-Green vortex problem computed at a M 0 = 0.1 
and Re = — — = 1600 using 256 degrees of freedom in each coordinate direction. Figure 4 also shows 
results using an incompressible spectral code with 512 degrees of freedom in each coordinate direction. 14 
Figure 5 show the corresponding iso-contours of vorticity at the point of peak dissipation using 2nd- and 
16th- order schemes. The higher-order scheme is less dissipative than the 2nd-order scheme and is able to 
more accurately match the spectral data. 



Figure 4. 


Taylor-Green vortex problem at M = 0.1, Re = 1600 computed using 256 3 degrees of freedom 


IV. Polynomial De-aliasing 

A VMM or LES method resolves only large scale motion in a flow to reduce the computational cost 
relative to DNS, with the unresolved finest scales modeled. While the coarsest scales are accurately resolved, 
the finest representable scales in the flow are under-resolved. These under-resolved flows cause numerical 
stability problems, especially for high-order numerical methods, and require some form of stabilization. 
There are many techniques for stabilization within a CG or DG spectral-element framework. 15-18 Here we 
examine polynomial de-aliasing within an element 18 (often referred to as “over-integration”) within a DG 
framework to demonstrate the issues involved and performance. Section VI discusses some more general 
issues for combining stabilization and physical modeling in a VMM for future work. 

The use of numerical quadrature implies inexact projection of the nonlinear products in the flux terms onto 
a finite-element space. This inexact projection causes aliasing errors and may ultimately lead to instability. 
Similar to de-aliasing techniques in spectral methods, projecting the state variables to a higher polynomial 
space to evaluate the nonlinear products, then projecting the result back to the original polynomial order, 
can remove these aliasing errors. 

For incompressible Navier-Stokes simulations, the nonlinear convection terms are quadratic in the state, 
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(a) N = 2 (b) N = 16 


Figure 5. Iso-contours of vorticity at peak dissipation for the Taylor-Green vortex evolution at M — 0.1, Re — 1600 
computed using 256 3 degrees of freedom. 


implying that for a solution with N = p+1, exact projection of the nonlinear terms may be performed using 
a Gaussian quadrature rule with 3/2 N points. 19 In the case of the compressible Navier-Stokes equations, the 
nonlinearity in the inviscid flux is a rational function of the conservative state and Gaussian quadrature is 
unable to exactly compute the projection of the Navier-Stokes equations onto the polynomial space. a It has 
been suggested that since the inviscid flux in the compressible Navier-Stokes equations is a cubic function 
of the primitive variables, using a quadrature rule with 2 N points is sufficient, 18,20 while others recommend 
a 3/2 TV quadrature. 21 Clearly these polynomial de-aliasing methods are somehwat ad-hoc for compressible 
flows. 

In order to understand the behavior of polynomial de-aliasing for subsonic compressible flows, we consider 
the errors in integration introduced through the use of quadrature. As a model problem, we consider a one- 
dimensional periodic domain using a single element with N = 16. We evaluate the spatial residual using 
increasing quadrature rules. The residual is computed about a flow with mean Mach number of 0.2 with 20% 
fluctuations in velocity and pressure. We consider flows with density fluctuations of 0, 5% and 20% about 
the mean. Figure 6 plots the maximum error in the residual for the energy equation for each mode using a 
Legendre polynomial basis, averaged over 100 random initial states. For flows with zero density fluctuation, 
using the 3/2N quadrature points allows for exact integration of the lowest quartile of modes, while Gaussian 
quadrature rules using 2 N or more points are exact for all modes. When density perturbations exist in the 
flow, then exact projection of the flux onto the polynomial space is not possible and an error is introduced 
for all modes. However, for density fluctuations on the order of 5% the use of 2N points is able to reduce 
this error by nearly 4 orders of magnitude. As the magnitude of the density fluctuations increases, additional 
quadrature is required to accurately integrate the rational functions. This accuracy (or lack thereof) appears 
to correlate with numerical stability, even in the presence of upwinding in the DG method. In particular, we 
have found that with increasing Mach and Reynolds number, polynomial de-aliasing with 2 N or 3N points 
is not always sufficient. 

We demonstrate the aliasing issue by performing a simulation of the Taylor-Green vortex evolution at a 
M = 0.1 and Re = 1600 using 64 DOF in each coordinate direction. In this computation it is known that 64 
DOF is under-resolved, however increasing resolution by only two or four times with a suitable high-order 
scheme leads to relatively good predictions, so this is representative of an under-resolved LES simulation. In 
Figure 7 we show the evolution of the kinetic energy dissipation rate using both collocation, and polynomial 
de-aliasing with 2 N points. Using collocation, for 2nd- and 4th- order the DG method provides sufficient 
numerical dissipation to maintain stability, however the higher-order simulations become unstable. Using 
polynomial de-aliasing with 2N quadrature points all of the simulations now remain stable, consistent with 
the experience in the literature. 20 

a Maintaining standard conservative form for the compressible equations, it is possible to formulate the Navier-Stokes flux 
terms purely as products of the primitive variables. In this situation multiplication by the inverse of the mass matrix leads to 
rational functions. 
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(a) No density variation (b) 5% density variation 



(c) 20% density variation 


Figure 6. Errors in evaluation of the residual using numerical quadrature as a function of N with 20% variation in 
velocity and pressure and density variation of a) 0, b) 5% c) 20% 



(a) Collocation (b) Polynomial de-aliasing 2 N 

Figure 7. Taylor-Green vortex problem at M = 0.1, Re = 1600 computed using meshes with 64 3 degrees of freedom using 
a) collocation, and b) polynomial de-aliasing with 2 N points 
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Increasing the Reynolds number to Re = 16k further reduces the available resolution, and numerical 
stability becomes difficult to maintain with either scheme. Figure 7 presents the kinetic energy dissipation 
rate using both collocation and polynomial de-aliasing with 2 N points for this higher Reynolds number. 
While polynomial de-aliasing allows the simulation to progress further in time than using collocation, the 
8th- and 16th-order simulations eventually become unstable. Increasing the de-aliasing to 3 N points is 
still insufficient to maintain stability for this test case. Given these results for a relatively benign problem, 
stabilization is a concern for our more complex target flow which has higher speed, higher Reynolds number, 
and larger perturbations. This is an area of planned future investigation. 
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(b) Polynomial de-aliasing 2N 


Figure 8. Taylor-Green vortex problem at M — 0.2, Re = 16000 computed using meshes with 64~ degrees of freedom 
using a) collocation, and b) polynomial de-aliasing with 2N points 


The use of polynomial de-aliasing with 2 N quadrature points in each coordinate direction increases the 
computational cost of a residual evaluation by roughly 10 times relative to collocation. For explicit time- 
stepping schemes the simulation cost is driven entirely by the cost of residual evaluations and hence the use 
of polynomial de-aliasing implies an order of magnitude increase in total computational cost. However, as 
we’ll see in the next section, the use of explicit schemes with our spectral-element framework is impractical 
for these high-Reynolds-number wall-bounded flows. For implicit time-stepping schemes, the cost of the 
simulation is primarily driven by the cost of the linear solver for a Newton method. As such, the increased 
cost of evaluating the target residual using a 2N quadrature rule is amortized over the linear solve, where 
approximate methods may be employed. By switching to collocation in the linear solve we can effectively 
provide both stability and efficiency while sacrificing very little in terms of convergence rate. 

V. Temporal Discretization 


V.A. Explicit 

To introduce the notation and demonstrate the high-order performance, we consider an explicit time-stepping 
scheme. This scheme also forms the basis of the preconditioner for the implicit scheme discussed in the 
following section. Using the method of lines, the semi-discrete form of Eq. (6), is written as: 

M^+R{U) = 0 (25) 

ot 

where U is the vector of spatial degrees of freedom, R(U) is the spatial residual and M is the mass matrix: 

% = E / ( 26 ) 

Equation (25) is a coupled system of ODEs which is solved numerically using an explicit time-stepping 
scheme. Each stage of an explicit scheme requires inversion of the mass matrix. In general, a CG dis- 
cretization has a mass-matrix which is globally coupled, while a DG formulation will have an element-wise 
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block-diagonal mass matrix. In either case, the cost of the storing, factoring and inverting the mass matrix 
is prohibitive, especially at high-order. However, using our collocation approach, the mass-matrix reduces 
to an easily invertible diagonal matrix for either CG or DG formulations. 

The use of GLL collocation to compute the mass matrix is inexact even in the case of elements with 
constant Jacobians, as the GLL quadrature with N points is exact only for polynomials up to order 2 N — 3. 
In a CG formulation a GLL basis ensures that degrees of freedom are associated with edges, faces and nodes, 
while use of a GL basis necessarily couples all degrees of freedom in the entire domain to ensure continuity. 
On the other hand in a DG formulation, where the degrees of freedom are associated with elements, the use 
of a GL collocated basis provides increased accuracy with minimal additional cost. 

In order to demonstrate the numerical efficiency of different formulations we compute the isentropic 
convection of a two-dimensional vortex described in Section III. A at a Mach number of 0.5. Figure 9 shows 
the L 2 error after 10 flow through periods. We use the DG collocation method with a classical 4th-order 
explicit Runge-Kutta time-stepping scheme. For this problem the temporal error is orders of magnitude 
lower than the spatial error, and the error is determined by the spatial error alone. As shown in Fig. 9, 
the use of GL quadrature is more accurate than GLL quadrature for a fixed number of degrees of freedom. 
The mass lumping with GLL quadrature reduces the order of accuracy of the scheme, consistent with the 
observation in [22], where the error is shown to act as a filter. As such, the effect of mass lumping is more 
significant at low order. We note that the reduced accuracy of using the GLL formulation is due solely to the 
mass-lumping and not to the reduced quadrature in the integration of the spatial part of the residual. This 
was verified by using more accurate quadrature for the spatial residual term, which gave results identical to 
those using collocation. 

As we are primarily concerned with efficiency, it is necessary to consider any additional cost of using 
GL quadrature versus GLL quadrature. For a single residual evaluation, the cost of GL quadrature is 
slightly more than GLL quadrature, due to evaluation of state or basis functions on element faces which 
requires interpolation using GL points, while are directly available using GLL points. However, for a viscous 
simulation interpolation of gradients is required (for either GLL or GL quadrature), and the additional cost 
of interpolating the state at the same time is minimal (on the order of 1-2 % of a residual evaluation). Thus, 
in terms of a single residual evaluation we view the use of the GL quadrature in a DG discretization as 
removing the unnecessary filtering of the highest modes in a colocation scheme, at essentially no additional 
cost. 



h=V DOF 


Figure 9. Isentropic vortex convection problem, Error vs. h using GL and GLL collocation 


Explicit methods have a maximum time-step which is limited by the CFL condition. For high-Reynolds- 
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number wall-bounded flows the numerical stiffness is greatly increased by the resolution required to accurately 
compute gradients near the wall. While we are interested in computing unsteady flows, and thus our time 
step is also limited by the desired temporal resolution of the physics of interest, the restriction due to stability 
may be much more severe, particularly with increasing order of accuracy. 

In order to demonstrate the stiffness associated with higher-order methods, we evaluate the maximum 
stable CFL number using the classical 4th-order explicit Runge-Kutta scheme for two model problems. While 
we are interested in solving viscous flows, the inviscid problem is representative of regions in the flow field 
far from the wall where viscous effect do not play a significant role. The viscous simulation, performed using 
a mesh size with cell Reynolds number, Reh = ^7 = 0(1), is representative of near wall regions of the flow 
field where fine mesh resolution is required to resolve steep gradients in the flow field. 

Figure 10 presents the maximum stable CFL number versus solution order N for the sample test problems. 
We compute the CFL number based on the acoustic speed CFL = where At is the time step, h = \J DOF 
is the resolution length scale and c the free-stream speed of sound. For the inviscid isentropic convection 
problem, the maximum allowable time step scales as 1/iV for a fixed number of degrees of freedom. When 
the cell Reynolds number is 0(1), the maximum stable time step scales as 1/N 2 for a fixed number of degrees 
of freedom. The scaling of the maximum allowable time step with N implies an increase in the number of 
time steps (and hence cost) which scales linearly for inviscid problems and quadratically for viscous problems 
as we increase the spatial resolution via increasing p. From this we see that our total computational cost 
estimate outlined in the introduction increases by N 2 with an explicit method, as our time-step is limited by 
our most restrictive stability limit, making the simulations prohibitive for higher-order methods and negating 
their advantage. 



Figure 10. Maximum CFL vs. solution order for the inviscid isentropic vortex convection problem and the viscous 
Taylor-Green vortex problem at Reh = 0(1). 


V.B. Matrix-free Implicit 

Implicit methods have the possibility for greater efficiency than explicit methods for higher-order methods, 
but require the solution of a globally coupled nonlinear problem at each iteration. In order for implicit 
methods to be competitive, the additional cost of solving this globally coupled problem must be offset by 
a sufficient decrease in the computational work relative to explicit methods. The memory required for the 
linearization of the residual grows rapidly with solution order N. Table 1 gives estimates of the memory 
required to store the full Jacobian matrix for the FAITH problem described in Section I. For reference, 
the state vector at a single time-step requires 40Gb of storage. Clearly, for the class of problems we are 
considering the cost of storing this Jacobian matrix is prohibitive. 

We choose a scheme with memory requirements similar to an explicit scheme, i.e. we trade the memory 
for increased operations which can be hidden via optimization and algorithm choice. As such, we consider 
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N 

Total Memory (Tb) 

2 

11 

4 

89 

8 

720 

16 

5700 


Table 1. Estimated memory usage for storing the Jacobian for the FAITH simulation using 10 9 degrees of freedom. 


Jacobian-free Newton-Krylov methods 23 for the solution of the nonlinear problem arising at each iteration, 

R*(U n+1 ) = 0 (27) 

where R*(U ) is the unsteady residual. In a Newton-Krylov method, Eq. (27) is solved using a Newton 
iteration, where each Newton step requires the solution of a linear system: 

U k = ~R*(U k ) (28) 

The linear system, Eq. (28), is solved using a Krylov method. In this work, we use a preconditioned 
Generalized Minimal Residual (GMRES) algorithm as our Krylov method. 24 In a Krylov method, the 
matrix is not required; we only require the application of to a vector V, i.e. we need to compute 
the Frechet derivative of R* in the GMRES search direction. 

In this work we consider two approaches for computing the Frechet derivative. The first approach, is to 
approximate the linearization using finite differences: 

DR\^ R*(U + eV)-R*(U) 

dU e [ ’ 

with suitable choise of step size e. 25 This approach allows the computation of a Frechet derivative for the 
cost of a single residual evaluation. A particular advantage of this approach is that the approximate Frechet 
derivative is available once the residual has been computed, and requires no linearization of residual terms. 
This is particularly advantageous for turbulence models, as hand linearization of various modeling terms is 
not required. On the other hand, the finite-difference computation of the Frechet derivative is sensitive to 
the step length e and inaccuracies in the Frechet derivative degrade the convergence of the Newton solver. 23 

An alternative approach is to compute the exact Frechet derivative. The exact Frechet derivative corre- 
sponds to the linearized form of Eq. (6). As with the residual, the linearized forms of Eqs. (8) and (9) are 
computed as a sequence of three steps: 

1. Evaluation of the state, gradient, linearized state and linearized gradient ([/, V£7, V and VV) at the 
quadrature point. 

2. Evaluation of the linearized fluxes (EYm = § jjV + VV) at the quadrature points. 

3. Multiplication of the linearized fluxes with the gradient of the basis functions. 

This approach is more expensive than the finite-difference approach, however, the additional cost eliminates 
the dependence on e, and is consistent with the solution of adjoint (dual) problems. In computing the exact 
Frechet derivative, we reuse the optimized kernels described in Section II for computing the sum factorization, 
though we require the additional hand linearization of the fluxes. In our implementation the cost of the exact 
Frechet derivative is approximately 50% more than a residual evaluation, though some of this cost is offset 
by improved convergence. 

We demonstrate the benefits of an implicit method by simulating the Taylor-Green vortex evolution 
at low Reynolds numbcr(i?e = 16) using 8th- and 16th-order spatial discretizations and 64 DOF in each 
coordinate direction. We use a 4th-order, 5-stage, diagonally-implicit Runge-Kutta (DIRK) scheme as our 
time-integration scheme. At each stage, the Newton-Krylov algorithm is used to reduce the unsteady residual 
by 10 orders of magnitude, where at each Newton step the GMRES method is run until the linear residual 
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has been reduced by 6 orders of magnitude or a maximum of 20 Krylov vectors are used. We use the 
mass-matrix as the preconditioner to GMRES. 

The cost of the implicit (DIRK) scheme relative to the 4-stage explicit Runge-Kutta method is presented 
in Fig. 11. We plot the cost as a function of time-step relative the explicit method and as a function of 
the CFL number. For this viscous test case, with Reh = 0(1), the maximum stable CFL for the explicit 
method scales as 1/N 2 . The implicit method allows us to run at larger CFL number as the time-step is not 
restricted due to stability, however, we limit the CFL to be 0(1) in order to maintain temporal accuracy 
of the simulation. The cost reported is the number of residual and Frechet derivative evaluations used for 
the implicit scheme, divided by the number of residual evaluations required for the explicit scheme. At a 
CFL number which corresponds to the largest stable timestep for the explicit scheme, the implicit scheme 
would required significantly more than the explicit scheme. As expected, by increasing implicit timestep the 
cost of the implicit scheme decreases relative to the explicit scheme. At CFL ss 1, the implicit method is 
more efficient than the explicit method, requiring 15-25% of the residual evaluations for the explicit scheme 
even though we are still only using the mass-matrix as a preconditioner. Increasing the CFL number much 
beyond CFL ss 1 causes the GMRES algorithm to stall and a stronger preconditioner is required. 




(a) (b) 

Figure 11. Computational cost of using diagonally-implicit Runge-Kutta method relative to explicit Runge-Kutta 
method for the Taylor-Green vortex problem (M = 0.1, Re — 16) using N — 8 and N = 16 with h = 1/y/DOF = 1/64. 


V.C. Preconditioning 

Preconditioning is necessary for stiff problems in order of obtain good performance in a Newton-Krylov 
scheme. 23 An effective preconditioner approximates the inverse of the Jacobian matrix and has the effect 
of clustering the eigenvalues of the preconditioned system, allowing for rapid convergence of the Krylov 
scheme. 23 We require preconditioning methods with minimal memory overhead as storage of a linearization is 
prohibitive. To this end, we consider an element-wise block-Jacobi preconditioner where the elemental blocks 
are solved approximately using an alternating-direction-implicit (ADI) scheme. 26 We use a diagonalized 
ADI scheme which has previously been used in finite-difference simulations, 2 ' but to our knowledge not in 
finite/spectral-element methods. 

We derive the diagonalized ADI scheme by considering a constant coefficient problem: 

r)u 

^ + V-(Au) = 0 (30) 

Applying a DG discretization with upwind flux, we have; 

?U{“’S“ Vu ’ Au } + l“’^ r “} = 0 (31) 

where A n u is the upwind flux evaluated as: 

A n u = \A n (u + + u~) + \\A n \(u + -u~) = A+u + + A~u~ (32) 
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where A n = A ■ n, while u + and «_ denote, respectively, trace values taken from inside or outside the 
element k. Applying a backward Euler time integration scheme, the contribution to Eq. (31) due to w and 
u restricted to an element k is given by: 


r* K {w\n,u\ K ) = J 


— u n+1 - Vw • Au n+1 
At 


wAIu 


+ ~.n+l 


f 8 k, 


(33) 


In Eq. (33), r* K (w\ K ,u\ K ) corresponds to an element-wise block diagonal entry of the Jacobian . Using 
a tensor product basis in two-dimensions we write u = Uij(f>i(^i)(j)j(^ 2 } and w = <^ m (£i)0n(£2)- Integrating 
in the reference element, the corresponding discrete form is given by: 

Fin* r r ( i 

"ID D / ’ 1 1 ^ 1 J 

bm 0n 7 02 0j 


dU z 


' £l •'& 

L 


At 


- (frrJ^^A?, faepj )• | J| 




(bj I I J* ^ | A J 4^i4^j | 'I| 

\ 1*^1 [ “b / ^ m ^ n -^-2 1*^1 [ 

J £i =1 J £ x J £i =1 


(34) 


1 dR* K 
W\dU. k 

where 


A, A?-- 

= |(Ai ± Ai|) and A 2 -- 

= |(A 2 ±|A 2 |). 

Equation (34) is written in matrix form as: 

^Mi <8 

T 8M3 )_ 

(d[ g Ai 

g M 2 ) - (Ml c 

?> A 2 g O 2 ) 


( 35 ) 

+ (©?■ 

“ g A]) <g>M 2 ) 

+ (Mi g 

A 2 g©f~) + 

(of + gA[g m 2 ) 

+( 

Mi g A 2 g 

Mi = J 

0m 02 

u 

Di= l 

^0m , 

<Pi 

1 

Df : ± = { 4 > m<t>i } €l = 

=±1 

(36) 

M 2 = 

to 

-e- 

-e- 


f 


=±1 

(37) 


In Eq. (35), we have assumed that the Jacobian of the mapping from element reference space to physical 
space is constant. The non-constant Jacobian case is addressed below. In order to clarify the derivation, we 
define operators ©i and O 2 which correspond to the action of interior and boundary convection operators in 
the £1 and £2 directions, respectively, such that we may rewrite Eq. (35) as: 


1 dR* K 
I J\ dU K 


I 

— <8 

At 

I 

3 — 
At 

g> A 1 < 




) Ai 


1 A\ 


At 


,)-(» 


I 

At 


-1 


I 

At 


«)-(' 


Following [26] the second term in Eq. (39) is omitted resulting in the approximate inverse: 


\J\ 


dR* 

dUl 


I 

At 


0 1 


— (g) M 2 I x 


I 

At 
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I <g 


H 

1 ) g/j 


1 g a 2 < 
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1 -1 


1 

At 


— g M? - A 


(^A.2 g) ©2^ 


I 

At 

-1 


/ x 


(40) 


(41) 


Here we have factored terms in Eq. (41) to highlight that the inverses correspond to the solution of one- 
dimensional problems in either £1 or £2 directions. The application of the ADI scheme can be computed 
using the following sequence of steps: 
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1. Solve N independent one-dimensional problems in the £1 direction. 


2. Scale by the time-step. 

3. Solve N independent one-dimensional problems in the £ 2 direction. 


The one-dimensional problems in steps 1 and 3 have m coupled flow equations. Following Pulliam and 
Chaussee 27 a diagonal form of Eq. (41) is obtained using an eigenvalue decomposition of A. We write 
Eq. (41) as: 


\J I 



At 
7® 

(7 <g> Ti <g> 7) 

' 7 , 


TpAjTi) 


-1 


7 

At 


- (T^K 2 T 2 - 


At 


1 A 1 ) 


7® — ® 7 ) x 
-1 


TV T' 


1 

At 


7 x 


— ®7) ^(A 2 ®D 2 ) 


-1 


(7 ® T 2 1 ® 7) 


(42) 


where Aj and A 2 are eigenvalues of Ai and A 2 with corresponding eigenvector T- { and T 2 . The diagonalized 
ADI scheme, corresponding to the action of the approximate inverse in Eq. (42), is summarized in the 
following steps: 

1. Transform to characteristic variables in the £1 direction at each collocation point. 

2. Solve N x m independent scalar one-dimensional advection problems in £1 direction. 

3. Transform to characteristic variables in the £ 2 direction and scale by the time-step. 

4. Solve N x m scalar one-dimensional advection problems in £ 2 direction. 

5. Transform back to original variables 

We note that in the diagonalized ADI scheme, we solve one-dimensional scalar advection problems, as 
opposed to a system of equations, which results in a savings without significant loss of convergence rate. 
Specifically, the diagonalized scheme requires forming and factoring 3 N x N systems corresponding to 3 
distinct eigenvalues as opposed to a 5 TV x 5 N system for the full ADI scheme. This reduces the cost by 
approximately a factor of 8. 

In case of variable coefficients (or variable Jacobian) the linearization, can not be expressed simply 
as the sum of tensor product of matrices. However, we may still apply the steps of the ADI or diagonalized 
ADI algorithms, recognizing that the one-dimensional problems now have variable coefficients. We note that 
while for constant coefficient problems the ADI scheme and diagonalized ADI scheme are mathematically 
equivalent. In the variable coefficient case the diagonalized scheme introduces an additional O(At) error. 2 ' 

Applying the diagonalized ADI scheme for the solution of the isentropic vortex convection problem at a 
Mach number of 0.5 with h = \J DOF = 1/64, Figure 12 shows the convergence history for a typical stage 
of the implicit scheme at CFL = 1.0 for 8th- and 16th-order solutions. The linear residual is plotted as a 
function of the number of residual or Frechet derivative evaluations using both mass-matrix and diagonalized 
ADI preconditioning. The diagonalized ADI scheme reduces the number of residual evaluations required by 
about a factor of two for this problem. As the cost of applying the diagonalized ADI scheme is the same as a 
single residual evaluation, the total CPU time for the diagonalized ADI scheme is same as using mass-matrix 
preconditioning. However, using the diagonalized ADI scheme has the benefit that fewer Krylov vectors are 
required. 

The diagonalized ADI scheme is an effective low-memory preconditioner for convection dominated prob- 
lems. In order to develop an efficient preconditioner for viscous dominated regions we intend to leverage our 
spectral-element framework and pursue spectral/p- multigrid methods. Spectral/p- multigrid methods have 
been widely used for CG spectral-element discretizations where theoretical and numerical results gives p- 
independent convergence rates for elliptic problems. 28,29 While no corresponding convergence theory exists 
for DG discretizations, spectral/p- multigrid approaches have also been widely applied to DG discretiza- 
tions. 30-33 
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(a) N = 8 (b) N = 16 

Figure 12. Residual convergence history for a single Newton-Step using diagonally-implicit Runge-Kutta scheme for 
the isentropic vortex convection problem with h = 1 /y/ DOF — 1/64. 


V.D. Space-Time formulation 

The implicit method with preconditioning provides an efficient scheme for higher-order unsteady problems. 
We can further improve this scheme by the use of a space-time formulation to integrate forward in time. 
The use of a space-time formulation enables us to adapt locally in both the spatial and temporal directions, 
similar to AMR methods with sub-cycling. Here the temporal spectral order (and hence the number of DOF) 
is controlled, leading to a significant reduction in cost as the increased resolution in time is only applied 
where necessary. 

The space-time formulation is contrasted with the implicit Runge-Kutta scheme graphically in Figure 
13. In an implicit RK scheme each stage is solved sequentially, requiring at each stage an implicit solve 
corresponding to a problem which is globally coupled in space at a single point in time. In a space-time 
formulation the solution is computed at the temporal collocation points. However, unlike the implicit scheme 
all temporal collocation points within a time-slab are solved concurrently in a single globally coupled problem. 
Figure 13 also depicts space-time adaptation, where different temporal solution orders are used in different 
elements, potentially reducing the total cost. 



Slab 


«i 


^2 «3 

(b) Space-Time 


Figure 13. Graphical representation of implicit and space-time formulations. The implicit scheme requires a globally 
coupled solve at each stages, while the space-time formulation requires a globally coupled solve over the entire space- 
time slab. 


Space-time formulations are generally considered too expensive for practical engineering simulations, as 
they require the solution of a globally coupled system of equations for each time-slab. As we have noted, 
the cost of storing the linearization for a single step of an implicit scheme is prohibitively expensive. Using 
a space-time formulation this storage cost is scaled by the order of the basis used in the temporal direction. 
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In the previous section, we demonstrate an implicit scheme with similar storage requirement as an explicit 
method using a matrix-free approach and diagonalized ADI preconditioner. Leveraging this full matrix-free 
approach allows us to afford the storage and efficient solution of the space-time formulation. 

Define a time interval (time-slab) I n = [£™,£™ +1 ], A space-time finite element space consisting of 
piece-wise polynomial functions in both space and time on each element k is given by: 

V h = {w,w\ K e[P(KX I)] m } (43) 

We seek a solution u £ Vh satisfying the weak form of Eq. (1): 

+ j ^w(C +1 )u(t n _ +1 ) - w(t n + )u(t n _)^ = 0 VwGVh. (44) 

As with our spatial discretization, we use a tensor product basis: 

u(x(£),t( T )) = Y, Uijk&ijkl ( 45 ) 

where £(r) defines a mapping from the reference interval r £ [—1,1] to the interval / = [£™,£™ +1 ]. The 
integrals in the space-time domain are computed using collocation and the sum-factorization approach, 
leading to a residual evaluation cost which scales as N x (d + 1). Equation (44) gives a system of nonlinear 
equations of the form R* (U) = 0, which must solved for each time slab. For a given spatial discretization, 
the size of system which must be solved in each time slab is now N times larger than that for the implicit 
scheme, as the space-time system involves all temporal collocation points in a space-time slab. We note 
that this results in an increase relative to the implicit scheme in the memory required for the storage of the 
Krylov vectors in GMRES. In the current implementation we can easily absorb this cost. 

The efficiency of our space-time formulation hinges on the ability of the Newton- Krylov scheme to rapidly 
converge the nonlinear problem arising at the time slab. We demonstrate the feasibility of space-time 
approach using a simple mass-matrix preconditioning for the viscous Taylor-Green problem at Re^ = 0(1). 
We solve the viscous Taylor-Green vortex problem at M = 0.1 and Re = 16 using 8th- and 16th-order 
spatial discretizations using 64 degrees of freedom in each coordinate direction. We use 8th- and 16th-order 
solutions respectively in the temporal direction. For each time-slab, the Newton-Krylov algorithm is used to 
reduce the unsteady residual by 10 orders of magnitude. For each Newton iteration we use the mass-matrix 
preconditioned GMRES method with 20 Krylov vectors. 

Figure 14 shows the cost of the Taylor-Green simulation relative to the explicit scheme as a function of 
the space-time CFL number CFL = cJN [ t A * and the relative number of explicit time steps. In the accounting 
in Figure 14 the cost of a space-time residual evaluation is N times that of a single spatial residual. As 
the space-time formulation does not have a time-step limited by stability constraints, we are able to use 
much larger time-steps than the explicit scheme. At CFL = 0(1) the space-time formulation using a simple 
mass-matrix preconditioner has a cost similar to the explicit scheme. We emphasize that our space-time 
formulation has similar memory requirements as an explicit scheme, and using only a very simple mass- 
matrix preconditioning has equivalent computational cost for viscous dominated problems. We anticipate 
that for wall bounded flows with improved preconditioning the space-time formulation will provide significant 
efficiency gain relative the explicit scheme. 

VI. Summary and Future Work 

This paper summarizes a CG and DG spectral-element framework for the simulation of compressible 
high-Reynolds-number flows in the absence of discontinuities. The focus of the work is maximizing the 
efficiency of the computational schemes in terms of memory and epu time to enable unsteady simulations 
with a large number of space-time DOF. A collocation scheme combined with optimized computational 
kernels for the Intel SIMD extensions for x86_64 micro-architectures provides a residual evaluation with 
computational cost independent of order of accuracy through 16th order, and competitive with existing low- 
order production computing codes. This optimized residual framework is leveraged to provide an efficient 
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(a) (b) 

Figure 14. Computational cost of using space-time formulation relative to explicit Runge-Kutta method for the Taylor- 
Green vortex problem (M — 0.1, Re — 16) using N — 8 and N = 16 with h — 1/y/DOF = 1/64. 


implicit scheme based on a matrix-free Newton-Krylov nonlinear solver. A novel preconditioner based on the 
finite-difference diagonalized ADI scheme was developed which maintains the low memory of the matrix-free 
implicit solver, while still providing improved convergence properties for practical problems. Finally, the 
emphasis on low memory usage throughout the solver development was leveraged to implement a coupled 
space-time DG solver to efficiently provide spectral-element h—p adaptive capability in both space and time. 

While there still remain some open items within the current documented capability, such as extending 
the ADI preconditioning to space-time, the core functionality is designed and implemented. The next 
step is to apply the new capability to benchmark turbulence problems, such as the Taylor-Green evolution, 
isotropic turbulence and decay, and the wall-bounded planar channel flow, before progressing to more complex 
separated flows. Further work is aimed at improving the existing capability further, primarily through the 
planned use of a spectral-multigrid solver, and development of an appropriate closure model. We view the 
closure modeling as coupled to the choice of stabilization strategy as in a variational multiscale framework 
both numerical stabilization and the subgrid scale model approximate the effect of the unresolved modes 
on the resolved modes . 1,34 Thus, we will consider alternative stabilization strategies such as Streamline- 
Upwind/Petrov-Galerkin (SUPG ), 35,36 Galerkin-Least-Square (GLS ) 37 or multiscale 38 methods within our 
finite-element framework. 
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